revised version of mini-project 02 goes here
For this Mini Project Analysis on Grocery stores in Detroit i have choosen dataset from “Grocery stores” datset in U.S.Government’s open data
For this project,I will be analyzing the “Grocery Stores” dataset.The dataset contains information on Stores located in the detroit city.The data available via the Data Driven Open Data portal.I intend to investigate patterns and trends in this data, such as the types of Grocery Stores,their locations.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.2 ✔ purrr 0.3.4
## ✔ tibble 3.2.1 ✔ dplyr 1.1.1
## ✔ tidyr 1.2.0 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.2
## Warning: package 'ggplot2' was built under R version 4.2.3
## Warning: package 'tibble' was built under R version 4.2.3
## Warning: package 'dplyr' was built under R version 4.2.3
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(ggplot2)
library(leaflet)
## Warning: package 'leaflet' was built under R version 4.2.3
library(plotly)
## Warning: package 'plotly' was built under R version 4.2.3
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
library(sf)
## Warning: package 'sf' was built under R version 4.2.3
## Linking to GEOS 3.9.3, GDAL 3.5.2, PROJ 8.2.1; sf_use_s2() is TRUE
Grocery_stores <- read_csv("C:/github/dataviz_final_project/data/Grocery.csv")
## Rows: 123 Columns: 20
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (12): Company, Address, City, State, Common_Name, Notes, PHONE, FAX, EMA...
## dbl (8): OBJECTID, ZipCode, Better_Lat, Better_Long, SquareFeet, Centroid_X...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Grocery_stores
## # A tibble: 123 × 20
## OBJECTID Company Address City State ZipCode Better_Lat Better_Long
## <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 1 NAFSU ENTERPRISE… 10320 … DETR… MI 48204 42.4 -83.2
## 2 2 RED FOX FOODLAND 10333 … DETR… MI 48238 42.4 -83.2
## 3 3 X Z INC 11100 … DETR… MI 48214 42.4 -83.0
## 4 4 UNIVERSITY FOOD … 1131 W… DETR… MI 48201 42.4 -83.1
## 5 5 SAVE A LOT 11825 … DETR… MI 48203 42.4 -83.1
## 6 6 MR CS SUPERMARKE… 12055 … DETR… MI 48224 42.4 -83.0
## 7 7 BISHR POULTRY & … 12300 … DETR… MI 48212 42.4 -83.1
## 8 8 GOLDEN BENGAL SE… 12500 … DETR… MI 48212 42.4 -83.1
## 9 9 BASHAR & MARK BR… 12740 … DETR… MI 48205 42.4 -83.0
## 10 10 GRAND PRICE INC 12955 … DETR… MI 48227 42.4 -83.2
## # ℹ 113 more rows
## # ℹ 12 more variables: SquareFeet <dbl>, Common_Name <chr>, Notes <chr>,
## # PHONE <chr>, FAX <chr>, EMAIL <chr>, WEBSITE <chr>, DIG_MEMBER <chr>,
## # Data_Source <chr>, Centroid_X <dbl>, Centroid_Y <dbl>, ORIG_FID <dbl>
The dataset provides information about various grocery stores.The resulting data frame is stored in the variable “Grocery_stores”.The dataset likely contains several columns representing different attributes of the grocery stores. These attributes may include details such as the store name, location, address, contact information, store size, product categories available, and other relevant information.
summary(Grocery_stores)
## OBJECTID Company Address City
## Min. : 1.00 Length:123 Length:123 Length:123
## 1st Qu.: 31.50 Class :character Class :character Class :character
## Median : 62.00 Mode :character Mode :character Mode :character
## Mean : 62.02
## 3rd Qu.: 92.50
## Max. :124.00
##
## State ZipCode Better_Lat Better_Long
## Length:123 Min. :48201 Min. :42.31 Min. :-83.24
## Class :character 1st Qu.:48209 1st Qu.:42.34 1st Qu.:-83.18
## Mode :character Median :48215 Median :42.37 Median :-83.10
## Mean :48217 Mean :42.37 Mean :-83.11
## 3rd Qu.:48227 3rd Qu.:42.40 3rd Qu.:-83.06
## Max. :48238 Max. :42.45 Max. :-82.95
## NA's :64 NA's :64
## SquareFeet Common_Name Notes PHONE
## Min. : 817 Length:123 Length:123 Length:123
## 1st Qu.: 8386 Class :character Class :character Class :character
## Median :14122 Mode :character Mode :character Mode :character
## Mean :17434
## 3rd Qu.:21557
## Max. :90709
## NA's :3
## FAX EMAIL WEBSITE DIG_MEMBER
## Length:123 Length:123 Length:123 Length:123
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Data_Source Centroid_X Centroid_Y ORIG_FID
## Length:123 Min. :-83.28 Min. :42.28 Min. : 1.00
## Class :character 1st Qu.:-83.17 1st Qu.:42.34 1st Qu.: 30.75
## Mode :character Median :-83.11 Median :42.38 Median : 60.50
## Mean :-83.10 Mean :42.38 Mean : 60.50
## 3rd Qu.:-83.03 3rd Qu.:42.41 3rd Qu.: 90.25
## Max. :-82.93 Max. :42.45 Max. :120.00
## NA's :3 NA's :3 NA's :3
str(Grocery_stores)
## spc_tbl_ [123 × 20] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ OBJECTID : num [1:123] 1 2 3 4 5 6 7 8 9 10 ...
## $ Company : chr [1:123] "NAFSU ENTERPRISES INC" "RED FOX FOODLAND" "X Z INC" "UNIVERSITY FOOD CENTER INC" ...
## $ Address : chr [1:123] "10320 PLYMOUTH RD" "10333 FENKELL ST" "11100 MACK AVE" "1131 W WARREN AVE" ...
## $ City : chr [1:123] "DETROIT" "DETROIT" "DETROIT" "DETROIT" ...
## $ State : chr [1:123] "MI" "MI" "MI" "MI" ...
## $ ZipCode : num [1:123] 48204 48238 48214 48201 48203 ...
## $ Better_Lat : num [1:123] 42.4 42.4 42.4 42.4 42.4 ...
## $ Better_Long: num [1:123] -83.2 -83.2 -83 -83.1 -83.1 ...
## $ SquareFeet : num [1:123] 15879 16131 14160 22101 16124 ...
## $ Common_Name: chr [1:123] "Shop A Lot Food Center" "Red Fox Foodland" NA "University Foods" ...
## $ Notes : chr [1:123] NA NA "Closed. Building for sale and phone off. - RL" NA ...
## $ PHONE : chr [1:123] NA NA NA "(313) 833-0815" ...
## $ FAX : chr [1:123] NA NA NA "(313) 833-5648" ...
## $ EMAIL : chr [1:123] NA NA NA "nyaldoo@spartanstores.com" ...
## $ WEBSITE : chr [1:123] NA NA NA "http://universityfoodsmidtown.com/" ...
## $ DIG_MEMBER : chr [1:123] NA NA NA "Yes" ...
## $ Data_Source: chr [1:123] "NETS/Devries" "NETS/Devries" "NETS/Devries" "NETS/Devries" ...
## $ Centroid_X : num [1:123] -83.2 -83.2 -83 -83.1 -83.1 ...
## $ Centroid_Y : num [1:123] 42.4 42.4 42.4 42.4 42.4 ...
## $ ORIG_FID : num [1:123] 1 2 3 4 5 6 7 8 9 10 ...
## - attr(*, "spec")=
## .. cols(
## .. OBJECTID = col_double(),
## .. Company = col_character(),
## .. Address = col_character(),
## .. City = col_character(),
## .. State = col_character(),
## .. ZipCode = col_double(),
## .. Better_Lat = col_double(),
## .. Better_Long = col_double(),
## .. SquareFeet = col_double(),
## .. Common_Name = col_character(),
## .. Notes = col_character(),
## .. PHONE = col_character(),
## .. FAX = col_character(),
## .. EMAIL = col_character(),
## .. WEBSITE = col_character(),
## .. DIG_MEMBER = col_character(),
## .. Data_Source = col_character(),
## .. Centroid_X = col_double(),
## .. Centroid_Y = col_double(),
## .. ORIG_FID = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
# Summarize the data
store_summary <- Grocery_stores %>%
group_by(Company) %>%
summarize(Count = n()) %>%
arrange(desc(Count)) %>%
top_n(10)
## Selecting by Count
store_summary
## # A tibble: 122 × 2
## Company Count
## <chr> <int>
## 1 Aldi Food Store 2
## 2 AMBASSADOR MARKET 1
## 3 Americana Foods 1
## 4 Apollo Market Place 1
## 5 Atlas Market 1
## 6 Azteca Supermercado 1
## 7 BARAKAH GROCERY 1
## 8 BASHAR & MARK BROTHERS MKT INC 1
## 9 BILLYS PRODUCE INC 1
## 10 BISHR POULTRY & FOOD CENTER 1
## # ℹ 112 more rows
# Sort the data by count in descending order
store_summary <- store_summary %>% arrange(desc(Count))
# Create an interactive bar plot
plot_ly(data = store_summary, x = ~Count, y = ~Company, type = "bar", orientation = "h") %>%
layout(
title = "Number of Stores by Company",
xaxis = list(title = "Count"),
yaxis = list(title = "Company"),
plot_bgcolor = "#f2f2f2",
paper_bgcolor = "#f2f2f2",
font = list(color = "black")
)
This plot is an effective tool for presenting and analyzing the distribution of stores across different companies, enabling easy identification of companies with the highest and lowest store counts.
The original plan was to create a bar chart showing the number of stores for each company in the dataset. The code performs the necessary steps for cleaning and preparing the data. It groups the data by the company and calculates the count of stores for each company.
The plot_ly() function from the Plotly library was employed. The data used was the summarized store information, with the count of stores represented on the x-axis and the company names on the y-axis. A horizontal bar plot was chosen for its effectiveness in comparing the store counts across different companies.The layout() function was used to customize the appearance of the plot. The title “Number of Stores by Company” was assigned to provide a clear understanding of the plot’s purpose. The x-axis and y-axis were labeled as “Count” and “Company,” respectively, to provide context for the variables being represented. The plot’s background color, as well as the font color, were adjusted to ensure readability.
The resulting visualization enables viewers to quickly grasp the distribution of stores among the top companies. By observing the bar lengths, it becomes evident which companies have a larger number of stores compared to others. This allows for comparisons and identification of major players in the grocery store industry. To further explore the data, additional approaches could be employed. For instance, one could create a map visualization to show the geographic distribution of these grocery stores. This would provide insights into the regional dominance of different companies. Additionally, a time series analysis could be conducted to observe the growth or decline of stores over a specific period.
Application of data visualization principles and design to create an effective and visually appealing plot. Some of the principles applied include are:
Simplified and clear representation: The plot uses a simple horizontal bar chart to represent the data accurately. The use of a single chart type makes it easy for viewers to understand the comparison between companies.
Effective use of color and layout: The plot uses a light gray background and black font color to ensure readability. The bar chart employs a consistent color scheme, avoiding excessive use of colors that could distract or confuse viewers.
Clear labeling and titles: The plot includes clear labels for the x-axis and y-axis, providing context for the data being presented. The title of the plot clearly indicates the purpose of the visualization, which is to show the number of stores by company.
# Summarize the data by location
location <- Grocery_stores %>%
drop_na(Better_Lat, Better_Long) %>%
group_by(Better_Lat, Better_Long) %>%
summarize(Count = n())
## `summarise()` has grouped output by 'Better_Lat'. You can override using the
## `.groups` argument.
location
## # A tibble: 59 × 3
## # Groups: Better_Lat [59]
## Better_Lat Better_Long Count
## <dbl> <dbl> <int>
## 1 42.3 -83.1 1
## 2 42.3 -83.1 1
## 3 42.3 -83.1 1
## 4 42.3 -83.1 1
## 5 42.3 -83.1 1
## 6 42.3 -83.1 1
## 7 42.3 -83.1 1
## 8 42.3 -83.1 1
## 9 42.3 -83.1 1
## 10 42.3 -83.1 1
## # ℹ 49 more rows
# Read the shapefile
shapefile_path <- "C:/github/dataviz_final_project/data/Grocery_Stores-shp"
shapefile <- st_read(shapefile_path)
## Reading layer `f153c201-8c92-410b-97ac-ee696e90c2e8202041-1-plts0l.jxh6d' from data source `C:\github\dataviz_final_project\data\Grocery_Stores-shp' using driver `ESRI Shapefile'
## Simple feature collection with 123 features and 20 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: 13417720 ymin: 285038.9 xmax: 13512010 ymax: 348893.5
## Projected CRS: NAD83(HARN) / Michigan South (ft)
shapefile
## Simple feature collection with 123 features and 20 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: 13417720 ymin: 285038.9 xmax: 13512010 ymax: 348893.5
## Projected CRS: NAD83(HARN) / Michigan South (ft)
## First 10 features:
## OBJECTID Company Address City State
## 1 1 NAFSU ENTERPRISES INC 10320 PLYMOUTH RD DETROIT MI
## 2 2 RED FOX FOODLAND 10333 FENKELL ST DETROIT MI
## 3 3 X Z INC 11100 MACK AVE DETROIT MI
## 4 4 UNIVERSITY FOOD CENTER INC 1131 W WARREN AVE DETROIT MI
## 5 5 SAVE A LOT 11825 WOODWARD AVE DETROIT MI
## 6 6 MR CS SUPERMARKET INC 12055 MORANG DR DETROIT MI
## 7 7 BISHR POULTRY & FOOD CENTER 12300 CONANT ST DETROIT MI
## 8 8 GOLDEN BENGAL SEAFOOD INC 12500 KLINGER ST DETROIT MI
## 9 9 BASHAR & MARK BROTHERS MKT INC 12740 GRATIOT AVE DETROIT MI
## 10 10 GRAND PRICE INC 12955 GRAND RIVER AVE DETROIT MI
## ZipCode Better_Lat Better_Lon SquareFeet Common_Nam
## 1 48204 42.37365 -83.16277 15879.452 Shop A Lot Food Center
## 2 48238 42.40218 -83.16338 16130.741 Red Fox Foodland
## 3 48214 42.37783 -82.98089 14160.291 <NA>
## 4 48201 42.35320 -83.07440 22100.822 University Foods
## 5 48203 42.39370 -83.08930 16124.321 Save A Lot
## 6 48224 42.42660 -82.95160 11388.136 Morang Super Market
## 7 48212 42.40951 -83.05703 5661.150 Bishr Poultry & Food Center
## 8 48212 42.41067 -83.06024 1857.125 Shukriya Market
## 9 48205 42.41950 -82.98770 35373.495 Mazen Foods
## 10 48227 42.38160 -83.17120 13626.768 Grand Price Food Center
## Notes PHONE
## 1 <NA> <NA>
## 2 <NA> <NA>
## 3 Closed. Building for sale and phone off. - RL <NA>
## 4 <NA> (313) 833-0815
## 5 Not in Detroit? Seems to be in Highland Park RL <NA>
## 6 <NA> <NA>
## 7 <NA> <NA>
## 8 <NA> <NA>
## 9 <NA> (313) 839-6400
## 10 <NA> (313) 934-1000
## FAX EMAIL WEBSITE
## 1 <NA> <NA> <NA>
## 2 <NA> <NA> <NA>
## 3 <NA> <NA> <NA>
## 4 (313) 833-5648 nyaldoo@spartanstores.com http://universityfoodsmidtown.com/
## 5 <NA> <NA> <NA>
## 6 <NA> <NA> <NA>
## 7 <NA> <NA> <NA>
## 8 <NA> <NA> <NA>
## 9 (313) 839-1322 <NA> http://mazenfoods.net/
## 10 (313) 934-3680 grandprice@svharbor.com <NA>
## DIG_MEMBER Data_Sourc Centroid_X Centroid_Y ORIG_FID
## 1 <NA> NETS/Devries -83.16289 42.37363 1
## 2 <NA> NETS/Devries -83.16346 42.40218 2
## 3 <NA> NETS/Devries -82.98041 42.37803 3
## 4 Yes NETS/Devries -83.07448 42.35264 4
## 5 <NA> NETS/Devries -83.08959 42.39383 5
## 6 <NA> NETS/Devries -82.95197 42.42664 6
## 7 <NA> NETS/Devries -83.05704 42.40947 7
## 8 <NA> NETS/Devries -83.06029 42.41066 8
## 9 Yes NETS/Devries -82.98731 42.41965 9
## 10 Yes NETS/Devries -83.17162 42.38153 10
## geometry
## 1 POINT (13448619 320702.4)
## 2 POINT (13448318 331101.2)
## 3 POINT (13497896 323063.2)
## 4 POINT (13472625 313408.4)
## 5 POINT (13468313 328352.5)
## 6 POINT (13505282 340901.6)
## 7 POINT (13477016 334187.9)
## 8 POINT (13476131 334607.2)
## 9 POINT (13495783 338199)
## 10 POINT (13446221 323546.8)
detroit_map <- leaflet() %>%
setView(lng = -83.05, lat = 42.33, zoom = 10) %>%
addProviderTiles("CartoDB.Positron") # Choose your desired tile provider
# Add the shapefile as markers
detroit_map <- detroit_map %>%
addCircleMarkers(
data = shapefile,
lng = ~Better_Lon,
lat = ~Better_Lat,
fillColor = "red", # Customize the fill color
fillOpacity = 0.6, # Customize the fill opacity
radius = 4 # Customize the marker radius
)
## Warning in validateCoords(lng, lat, funcName): Data contains 64 rows with either
## missing or invalid lat/lon values and will be ignored
# Display the map
detroit_map
The map provides a visual representation of the shapefile data on the Detroit area. The red markers highlight specific locations based on their corresponding longitude and latitude values. This map can be used to identify and analyze patterns or spatial distributions within the dataset. It serves as a useful tool for exploring and visualizing geographic information related to Detroit.
Original plan was to create a map visualization showing the distribution of grocery stores in Detroit. To accomplish this, the data has to be cleaned and prepared.Removed missing values in the latitude and longitude variables and grouped the data by these coordinates.Then calculated the count of grocery stores at each location, resulting in the ‘location’ dataset.
The story that can be told with these plots revolves around the spatial distribution of grocery stores in Detroit. By visualizing the locations of grocery stores on a map, we can identify areas with higher concentrations of stores as well as areas with limited access to fresh food. This information could be valuable for understanding food deserts and areas that may require additional attention to improve food accessibility for residents.
One potential difficulty encountered during the visualization process could be related to data quality. It is important to ensure that the latitude and longitude values are accurate and properly aligned with the corresponding locations on the map. Outliers or incorrect data entries could lead to misleading visualizations. Therefore, data validation and verification steps should be performed to minimize such issues.
Additional Approaches for Data Exploration: There are additional approaches that can be employed to further explore the selected data.
Heatmap: Instead of individual markers, we can use a heatmap representation to visualize the density of grocery stores across different areas of Detroit. This would provide a more continuous representation of store distribution.
Clustering Analysis: Utilizing clustering algorithms such as K-means or DBSCAN could help identify distinct clusters of grocery stores. This approach can uncover patterns and spatial groupings that may not be immediately apparent in the raw data.
we applied several principles of data visualization and design to create meaningful and informative plots:
Visual Encoding: We used visual encodings such as color, opacity, and marker size to represent the grocery store locations effectively. The use of red markers with varying opacity and size allowed for easy differentiation and emphasized the presence of stores in specific areas.
Interactive Elements: The map visualization created using the leaflet package allowed for interactive exploration by users. Users can zoom in and out, pan across the map, and click on markers for additional information. This interactivity enhances user engagement and promotes a deeper understanding of the data.
Simplification and Focus: By focusing on the geographic representation of grocery store locations, we simplified the visual display to convey the main message clearly. Unnecessary elements were removed, ensuring the viewer’s attention is directed towards the key insights.
Consistency and Readability: The choice of color (red) for the markers was consistent with the common association of red as a symbol of locations or points of interest. The marker size and opacity were carefully adjusted to ensure readability without overwhelming the map.
filtered_data <- Grocery_stores %>%
group_by(Company) %>%
summarize(Total_SquareFeet = sum(SquareFeet)) %>%
top_n(30, Total_SquareFeet) %>%
inner_join(Grocery_stores, by = "Company")
filtered_data
## # A tibble: 31 × 21
## Company Total_SquareFeet OBJECTID Address City State ZipCode Better_Lat
## <chr> <dbl> <dbl> <chr> <chr> <chr> <dbl> <dbl>
## 1 Aldi Food S… 33435. 56 14708 … DETR… MI 48215 42.4
## 2 Aldi Food S… 33435. 57 15415 … DETR… MI 48205 42.4
## 3 Apollo Mark… 41262. 61 20250 … Detr… MI 48219 NA
## 4 Atlas Market 22064. 62 2645 W… Detr… MI 48238 NA
## 5 Azteca Supe… 24247. 63 2411 C… Detr… MI 48209 NA
## 6 BASHAR & MA… 35373. 9 12740 … DETR… MI 48205 42.4
## 7 E R & G INC 28660. 45 5800 C… DETR… MI 48212 42.4
## 8 FOOD PRIDE … 36025. 42 500 E … DETR… MI 48201 42.4
## 9 Family Fair… 27843. 68 700 Ch… Detr… MI 48207 NA
## 10 Family Food… 60050. 69 8665 R… Detr… MI 48206 NA
## # ℹ 21 more rows
## # ℹ 13 more variables: Better_Long <dbl>, SquareFeet <dbl>, Common_Name <chr>,
## # Notes <chr>, PHONE <chr>, FAX <chr>, EMAIL <chr>, WEBSITE <chr>,
## # DIG_MEMBER <chr>, Data_Source <chr>, Centroid_X <dbl>, Centroid_Y <dbl>,
## # ORIG_FID <dbl>
# Fit a linear regression model
model <- lm(SquareFeet ~ Company, data = filtered_data)
# Extract the coefficients from the model
coefficients <- coef(model)
# Create a plot of the linear model and its associated coefficients
plot_model <- ggplot(filtered_data, aes(x = Company, y = SquareFeet)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "blue") +
labs(title = "Linear Model: Square Feet by Company", x = "Company", y = "Square Feet") +
theme_bw() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
coord_flip()
## <ggproto object: Class CoordFlip, CoordCartesian, Coord, gg>
## aspect: function
## backtransform_range: function
## clip: on
## default: FALSE
## distance: function
## expand: TRUE
## is_free: function
## is_linear: function
## labels: function
## limits: list
## modify_scales: function
## range: function
## render_axis_h: function
## render_axis_v: function
## render_bg: function
## render_fg: function
## setup_data: function
## setup_layout: function
## setup_panel_guides: function
## setup_panel_params: function
## setup_params: function
## train_panel_guides: function
## transform: function
## super: <ggproto object: Class CoordFlip, CoordCartesian, Coord, gg>
plot_coefficients <- ggplot(filtered_data, aes(x = Company, y = coefficients[2] + coefficients[1] * seq_along(Company))) +
geom_point(color = "red") +
geom_abline(intercept = coefficients[2], slope = coefficients[1], color = "blue") +
labs(title = "Coefficients Plot: Square Feet by Company", x = "Company", y = "Square Feet") +
theme_bw() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
coord_flip()
## <ggproto object: Class CoordFlip, CoordCartesian, Coord, gg>
## aspect: function
## backtransform_range: function
## clip: on
## default: FALSE
## distance: function
## expand: TRUE
## is_free: function
## is_linear: function
## labels: function
## limits: list
## modify_scales: function
## range: function
## render_axis_h: function
## render_axis_v: function
## render_bg: function
## render_fg: function
## setup_data: function
## setup_layout: function
## setup_panel_guides: function
## setup_panel_params: function
## setup_params: function
## train_panel_guides: function
## transform: function
## super: <ggproto object: Class CoordFlip, CoordCartesian, Coord, gg>
# Display both plots side by side
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
grid.arrange(plot_model, plot_coefficients, ncol = 2)
## `geom_smooth()` using formula = 'y ~ x'
The first plot, titled “Linear Model: Square Feet by Company,” shows the data points as individual blue dots and a blue line representing the linear regression fit to the data. The line represents the estimated relationship between the company and the square footage. If the line has a positive slope, it indicates that as the company changes, the square footage tends to increase. Conversely, a negative slope suggests a decrease in square footage with a change in the company.
The second plot, titled “Coefficients Plot: Square Feet by Company,” displays the coefficients of the linear model. Each red point represents a company’s coefficient value, while the blue line represents the linear relationship between the company and square footage. The intercept value (y-axis value when x = 0) is given by the coefficient[2], and the slope (change in y for a unit change in x) is represented by coefficient[1]. The red points indicate how the coefficients vary across different companies.
By examining these plots, one can determine the strength and direction of the relationship between the company and square footage. Additionally, the coefficients plot provides insights into the specific impact of each company on the square footage.Overall, these plots help visualize the linear model’s fit to the data and provide a graphical representation of the coefficients’ values, allowing for a better understanding of the relationship between the company and square footage in the given dataset.
The original charts I planned to create for this assignment were a scatter plot with a linear regression line showing the relationship between the “SquareFeet” variable and the “Company” variable. Additionally, I wanted to create a plot to visualize the coefficients of the linear model.To prepare the data for the visualization, I assumed that the data was already loaded into a data frame called “filtered_data.” The necessary steps for cleaning and preparing the data would include importing the data into R, handling missing values, removing outliers if necessary, and ensuring that the variables of interest are in the correct format for analysis.
In the first plot, I used ggplot to create a scatter plot of the “SquareFeet” variable against the “Company” variable. I added a smooth line using the method “lm” to fit a linear regression model to the data. This plot allows us to visualize the overall trend between the two variables and see if there is a linear relationship. In the second plot, I created a coefficients plot to show the estimated coefficients of the linear model. I used the coefficients from the linear regression model to calculate the predicted values for the “SquareFeet” variable based on the “Company” variable. I plotted these predicted values against the “Company” variable and added a line representing the linear relationship described by the coefficients.
The story I could tell with these plots is the relationship between the “SquareFeet” and “Company” variables. The scatter plot with the linear regression line helps us understand the overall trend and direction of the relationship. It shows if there is a positive or negative association between the variables and how strong the relationship is.
The coefficients plot provides insight into the specific impact of the “Company” variable on the “SquareFeet” variable. It shows the estimated slope (coefficient) of the linear relationship and the intercept. This plot allows us to compare the effect of different companies on the square footage.
One difficulty I encountered while creating the visualizations was ensuring that the data was in the correct format for plotting. I assumed that the data was already cleaned and prepared, but in a real-world scenario, data cleaning and preprocessing steps would be necessary to handle missing values, outliers, and ensure data integrity.
In terms of additional approaches to explore the data, we could consider adding confidence intervals to the linear regression line in the scatter plot. This would provide a measure of uncertainty around the estimated line and give a sense of the variability in the relationship.
To apply the principles of data visualization and design, I used the ggplot library in R to create the plots. I ensured that the plots had clear titles, axis labels, and a consistent theme. I also adjusted the angle and justification of the x-axis labels to improve readability. By using different colors for the data points, regression line, and coefficients, I made it easier to distinguish between the elements of the plots.
Overall, the visualizations provided a clear and concise representation of the relationship between the “SquareFeet” and “Company” variables, allowing for easy interpretation and analysis.
The Motivation behind the linear model explored in the code is to understand the relationship between the “SquareFeet” variable and the “Company” variable in the dataset. The code calculates the total square feet for each company, selects the top 30 companies with the highest total square feet, and then fits a linear regression model to examine how the square footage varies across different companies.
Linear regression is a commonly used statistical technique to model the relationship between a dependent variable and one or more independent variables. In this case, the dependent variable is “SquareFeet,” and the independent variable of interest is “Company.”
As for the attributes that would be appropriate when trying to predict the values of “SquareFeet,” it depends on the specific problem and context. In the given code, only the “Company” variable is considered as the independent variable. However, there may be other attributes or variables in the dataset that could potentially influence the square footage of a grocery store, such as location, store type, demographics, or other factors related to the company or market.